Nature Machine Intelligence
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match Nature Machine Intelligence's content profile, based on 61 papers previously published here. The average preprint has a 0.13% match score for this journal, so anything above that is already an above-average fit.
Shi, M.; Zheng, H.; Gottumukkala, R.; Jonathan, N.; Armstong, G. W.; Shen, L. Q.; Wang, M.
Show abstract
Early screening for glaucoma and diabetic retinopathy (DR) is critical to prevent irreversible vision loss, yet remains inaccessible to many underserved populations. However, AI models trained on hospital-grade fundus images often generalize poorly to low-cost images acquired with portable devices such as smartphones. We proposed CausalFund, a causality-inspired learning framework for training AI models that enable reliable low-resource screening from easily acquired non-clinical images. CausalFund disentangles disease-relevant retinal features from spurious image factors to achieve domain-generalizable screening across clinical and non-clinical settings. We integrated CausalFund with seven deep learning backbones for glaucoma and DR screening from portable-device fundus images, including lightweight architectures suitable for on-device deployment. Across diverse experimental settings and image quality conditions, CausalFund consistently improved AUC and achieved a more favorable sensitivity-specificity trade-off than conventional deep learning baselines. As a model-agnostic framework, CausalFund could be extended to other diseases and low-resourced scenarios characterized by degraded or non-standard imaging.
Mascart, C.; Tran, K.; Samoilova, K.; Storan, L. T.; Liu, T.; Koulakov, A.
Show abstract
Recent advances in deep learning have enabled prediction of odorant perception from molecular structure, opening new avenues for odor classification. However, most existing models are limited to predicting percepts from fixed vocabularies and fail to capture the full richness of olfactory experience. Progress is further limited by the scarcity of large-scale olfactory datasets and the lack of standardized metrics for evaluating free-form natural-language odor descriptions. To address these challenges, we introduce Odor Description and Inference Evaluation Understudy (ODIEU), a benchmark which includes perceptual descriptions of over 10,000 molecules paired with a model-based metric for evaluating free-form odor text descriptions. The model-based metric uses Sentence-BERT (SBERT) models which are fine-tuned on olfactory descriptions to allow better evaluation of human-generated odor texts. Using the fine-tuned SBERT models, we show that free-form text odor descriptions contain additional perceptual information in their syntactic structure compared to semantic labels. We further introduce CIRANO (Chemical Information Recognition and Annotation Network for Odors), a transformer-based model that generates free-form odor descriptions directly from molecular structure, thus implementing the molecular structure-to-text (S2T) prediction. CIRANO achieves performance comparable to humans. Finally, we generate human-like descriptions from mouse olfactory bulb neural data using an invertible SBERT model, yielding neural-to-text (N2T) predictions highly aligned with human descriptions. Together, CIRANO and ODIEU establish a standardized framework for generating natural language olfactory descriptions and evaluating their alignment with human perception. Code is available at https://github.com/KoulakovLab/ODIEU
HOU, Z.; Lee, V. H.-F.; Kwong, D. L.-W.; Guan, X.; Liu, Z.; Dai, W.
Show abstract
The advent of artificial intelligence (AI) has brought revolutionary tools for biomedical transcriptomic (RNA-level) research. However, there are persistent constraints including limited interpretations with biomedical concepts such as functional pathways, small sample sizes and substantial time and computing power requirements for AI training. To overcome these limitations, we developed RNAGAN (https://github.com/ZhaozhengHou-HKU/RNAGAN-1.0.git), an AI tool with a generative adversarial network (GAN) structure with the objective of enhancing transcriptomic analysis. The network was established based on public human datasets comprising 4.6 million single cells from multiple organs and 5,900 sequenced samples of various cancer types with normal references. A specialized pathway neural layer was embedded to extract activities of predefined pathways from the Human Molecular Signatures Database (MSigDB), or newly learned pathways from single-cell data. The structure of RNAGAN (generator and discriminator) enables four applications after one shared training procedure: 1. single-cell and bulk-level patient stratification or differential diagnosis; 2. analysis of the gene and pathway markers in a selected disease; 3. pseudo data generation when sample size is limited for downstream analysis; 4. vectorization with gene and pathway-level features learned from multiple data sets. RNGAN contributes to the efficient utilization of limited data for transcriptomic studies.
Dee, W.; Wenteler, A.; Seal, S.; Morris, O.; Slabaugh, G.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWPervasive batch effects are a common issue, especially in recent large-scale Cell Painting datasets, which have been produced to aid AI-enhanced drug discovery efforts. Technical differences arising from experiments carried out in different batches can cause models to fail to generalize to unseen batches, despite good predictive performance "within batch". We propose a biologically grounded test-time adaptation framework, SHOT-CCR, which uses cell-invariant gradient reversal to decouple morphological signal from experimental confounders. Our approach performs 4.5% better than the current RxRx1 benchmark, classifying 1,139 classes of siRNA genetic perturbations with 91.6% accuracy. We deliver consistent results over four distinct cell types and two prominent Cell Painting datasets - RxRx1 and a subset of JUMP-CP. Across 484 classes of CRISPR perturbations in JUMP-CP our method improves accuracy by 15.7%.
Wang, X.; Tan, R.; Cristea, S.
Show abstract
Cancer evolution is driven by complex changes in gene expression as cells transition and change states during tumorigenesis. Single-cell RNA sequencing has provided snapshot insights into how the transcriptomics of tumors evolve, but whether the existing knowledge can be used to reliably learn and generate the patterns behind the evolution of cancers remains unknown. Here, we introduce evoCancerGPT, a generative pre-trained transformer decoder-only single-cell foundation model designed to forecast future gene expression profiles in cancer evolution by leveraging previous cell states at the level of single patients. This model integrates the continuous gene expression data of each cell to create a comprehensive representation of a cell token. Training sentences are constructed for each cancer type, each patient and each cell type separately, ordered via inferred pseudotime algorithms, using 2.76 million cell tokens, each with 12,639 genes, spanning 7 cancer types. By learning from long-range dependencies between cells arranged in pseudotime from a large corpus of data, evoCancerGPT captures key transitions in cancer evolution, achieving high concordance to ground truth trajectories and outperforming linear and scGPT baselines in held-out test samples in low-context scenarios. Our work suggests evoCancerGPTs potential utility in characterizing tumor progression at a single-cell and single-patient level and ultimately contributing to more personalized cancer care.
Baek, B.; Jang, E.; Kim, Y.; Kang, M.
Show abstract
MotivationKnowledge-guided learning offers effective and robust model training strategies in data-scarce settings by incorporating established domain knowledge, thereby enhancing generalization, robustness, and interpretability. By contrast, conventional deep learning approaches rely purely on data-driven learning, which can limit robust model interpretability, particularly in high-dimensional settings with limited size samples. In computational biology, knowledge-guided learning has primarily leveraged network- and structural-based knowledge, leading to biologically interpretable representations and enhanced predictive performance compared to conventional approaches. However, curated biomarkers, one of the most accessible forms of biological knowledge, remain largely unexplored within knowledge-guided paradigms. ResultsIn this study, we propose a model-agnostic training paradigm, Biomarker-driven Explainable Prior-guided Learning (BioExPL), that can be applied to any neural networks that incorporates curated prior knowledge. BioExPL enforces neural networks to reflect curated biomarker priors in their latent representations through a novel knowledge-alignment loss. BioExPL consistently demonstrated significantly improved predictive performance and enhanced model interpretability with minimized computational overhead in simulation studies and intensive experiments on multiple cancer datasts. BioExPL not only integrates prior curated knowledge into the model but also accurately identifies unknown associated signals additionally. BioExPL is model-agnostic and domain-independent, enabling its integration into diverse neural network architectures. Availability and implementationThe open-source is publicly available at: https://github.com/datax-lab/BioExPL.
Lin, J. B.; Mataraso, S. J.; Chadha, M.; Velez, G.; Mruthyunjaya, P.; Aghaeepour, N.; Mahajan, V. B.
Show abstract
PurposeThere is a need for novel therapies for diabetic retinopathy (DR) because existing therapies treat only certain features of DR and do not work optimally for all patients. While proteomic studies provide insight into disease pathobiology, they are often limited to small sample sizes due to high costs, limiting their generalizability and reproducibility. Moreover, they often yield lists of tens to hundreds of proteins with differential expression, making it difficult to prioritize the most biologically relevant biomarkers beyond using arbitrary fold-change and false-detection rate cutoffs. Here, we applied a two-stage multimodal AI approach: first, we integrated EHR and proteomics data to rationally prioritize candidate protein biomarkers and, next, validated these biomarkers in an independent cohort. These protein biomarkers of DR are rooted in the EHR data and thereby more likely to be biological drivers of disease. MethodsWe obtained EHR data from a large number of patients with and without DR (N=319,997) from the STARR-OMOP database and obtained aqueous humor liquid biopsies from a subset of these patients (N=101) for high-resolution proteomic profiling. We developed Clinical and Omics Multi-Modal Analysis Enhanced with Transfer Learning (COMET) to perform integrated analysis of proteomics and all available EHR data to identify protein biomarkers of DR. The model was trained in two phases: first, it was pretrained using patients with EHR data alone (N=319,896), and then, it was fine tuned using patients with both EHR and proteomics data (N=101), allowing it to learn both clinical and molecular features associated with DR. Findings from COMET were then validated with liquid biopsies from an independent, validation cohort (N=164). Resultst-distributed stochastic neighbor embedding (t-SNE) analysis of EHR and proteomics data identified proteins clustering with related EHR features. Levels of STX3 and NOTCH2, proteins involved in retinal function, were correlated with a diagnosis of macular edema, a record of a visual field exam, and a prescription for latanoprost, highlighting protein-EHR alignment. The pretrained, multimodal COMET model was superior (AUROC=0.98, AUPRC=0.91) compared to models generated using either EHR or proteomics data alone or without pretraining (AUROC: 0.76 to 0.92; AUPRC: 0.47 to 0.74). The proteins SERPINE1, QPCT, AKR1C2, IL2RB, and SRSF6 were prioritized by the COMET model compared to the models without pretraining, supporting their potential role in DR pathobiology, and were subsequently validated in an independent cohort. ConclusionWe used multimodal AI to prioritize protein biomarkers of DR that are most strongly linked to EHR elements, as well as identifying other protein biomarkers associated with disease features like diabetic macular edema. These findings serve as a foundation for future mechanistic studies and highlight the synergistic value of using multimodal AI to fuse EHR and proteomics data for enhanced proteomics analysis.
Lu, H.-E.; Koivisto, D.; Lou, Y.; Zeng, Z.; Yu, T.; Wang, J.; Meng, X.; Nowikow, C.; Wilson, R.; Kumbhare, D.; Pu, J.
Show abstract
Deep learning has transformed medical image and video analysis, but it usually requires large, well annotated datasets. In many clinical domains, especially when testing novel mechanistic hypotheses, such retrospective datasets are hard to obtain since acquiring adequate cohorts is time intensive, costly, and operationally difficult. This creates a critical translational gap: scientifically compelling early stage ideas may remain untested due to lack of sufficient sample size to support conventional deep learning pipelines. Developing data-efficient strategies for evaluating new hypotheses within small prospective cohorts is therefore essential to de-risk innovation before large-scale validation. Myofascial Pain Syndrome (MPS) exemplifies this challenge, as quantitative ultrasound imaging biomarkers for MPS remain underexplored. We investigated whether MPS in the upper trapezius can be detected from full B-mode ultrasound videos in a small prospective cohort (11 controls, 13 patients). Videos were automatically preprocessed and resampled using a sliding window strategy to expand training samples (404 clips). A self-supervised Video Diffusion Encoder (VDE) is developed to learn spatiotemporal representations without relying on extensive labeled data, and compared it with transfer-learning-based ResNet, VideoMAE, and SimCLR. Using subject-level stratified four-fold cross-validation, the VDE outperformed transfer learning baselines and achieved performance comparable to SimCLR, with subject-level AUC of 0.79 and accuracy of 0.86, and no significant differences between latent-only and combined trigger point analyses. These results demonstrate that self-supervised diffusion learning can support robust, data-efficient deep learning in small prospective studies, enabling early feasibility testing of innovative ultrasound biomarkers before large-scale clinical trials.
Shen, L.; Chao, L.; Liu, T.; Liu, Q.; Zhou, G.; Wang, H.; Dong, X.; Li, T.; Zhang, X.; Ni, J.
Show abstract
While protein language models typically rely on sequence-only pretraining objectives, this approach often fails to capture structural regularities and demands large computation. To address this, we introduce ProteinSage, a pretraining framework that learns protein representations under explicit structural constraints. ProteinSage incorporates structural signals via structure-guided masking and a causal objective designed to model long-range dependencies. This structure-constrained pretraining endows ProteinSage with highly transferable representations that achieve superior performance across diverse structure-aware and general protein modeling benchmarks, while requiring substantially less computation.To determine whether these gains stem from genuine structural generalization rather than task-specific fitting, we applied ProteinSage to a structure-driven protein discovery task, focusing on proteins with multi-pass trans-membrane helical architectures such as distantly related microbial rhodopsins. The model successfully identified six previously unannotated microbial rhodopsin homologs. Together, our work establishes structure-constrained pretraining as an effective pathway toward data-efficient and structurally faithful protein representation learning.
Kong, Z.; Zhu, Y.; Xu, Y.; Yin, M.; Hou, T.; Wu, J.; Xu, H.; Hsieh, C.-Y.
Show abstract
Designing protein sequences with desired properties is a fundamental task in protein engineering. Recent advances in deep generative models have greatly accelerated this design process. However, most existing models face the issue of distribution centralization and focus on local compositional statistics of natural sequences instead of the global semantic organization of protein space, which confines their generation to specific regions of the distribution. These problems are amplified for functional proteins, whose sequence patterns strongly correlate with semantic representations and exhibit a long-tailed functional distribution, causing existing models to miss semantic regions associated with rare but essential functions. Here, we propose ProtFlow, a generative model designed for comprehensive semantic distribution learning of protein sequences, enabling high-quality sequence generation. ProtFlow employs a rectified flow matching algorithm to efficiently capture the underlying semantic distribution of the protein design manifold, and introduces a reflow technique enabling one-step sequence generation. We construct a semantic integration network to reorganize the representation space of large protein language models, facilitating stable and compact incorporation of global protein semantics. We pretrain ProtFlow on 2.6M peptide sequences and fine-tune it on antimicrobial peptides (AMPs), a representative class of therapeutic proteins exhibiting unevenly distributed activities across pathogen targets. Experiments show that ProtFlow outperforms state-of-the-art methods in generating high-quality peptides, and AMPs with desirable activity profiles across a range of pathogens, particularly against underrepresented bacterial species. These results demonstrate ProtFlows effectiveness in capturing the full training distribution and its potential as a general framework for computational protein design.
Choi, D.; Yip, C.; Choi, A.; Park, J.
Show abstract
Synthetic augmentation can silently harm subject-disjoint EEG generalization. We propose trustgated augmentation (TGA), a control layer that scores synthetic windows with a teacher trained on real data for label consistency and confidence; only samples above a confidence quantile q are eligible. A fail-closed selector injects synthetic data only if validation AUROC exceeds real-only by a margin, otherwise reverting to real-only. In PainMunich chronic-pain EEG (n = 189) at 5% subject scarcity, ungated augmentation harmed 56% of paired runs ({Delta}AUROC< -0.01), whereas TGA at q = 0.99 reduced harm to 24% with comparable mean AUROC. In BCI IV-2a motor imagery (n = 9) at 25% scarcity, strict gating improved AUROC (0.679 vs 0.627) and reduced harm (0.16 vs 0.44). A covariance-manifold audit showed synthetic windows were strongly off-manifold (mean distance ratio 2.39 x 104), motivating explicit governance.
Banerjee, P.; Friedberg, I.; Rued, B. E.; Eulenstein, O.
Show abstract
MotivationAntibiotic-resistant infections in humans and animals are rising, creating an urgent need for new antimicrobial strategies. This challenge extends beyond clinical settings to food production systems; the Centers for Disease Control and Prevention estimates that foodborne pathogens cause over 48 million illnesses annually in the U.S. alone. Antimicrobial peptides (AMPs) are a promising alternative, with broad activity and lower risk of resistance. However, rational design remains challenging, especially when simultaneously controlling sequence, function, and peptide length. ResultsWe introduce Peptide modeLs for Understanding and engineering antiMicrobial therapeutics (PLUM), a structured conditional Variational Autoencoder for controlled AMP generation. PLUM disentangles sequence, function, and length in its latent space, enabling de novo and prototype-conditioned generation of peptides 5-35 amino acids long, allowing capture of larger functional domains. Across 45,000 generated peptides, PLUM achieved the highest AMP yield (0.885, 7% higher than HydrAMP) and increased AMP diversity (14% higher than HydrAMP), while maintaining the highest non-AMP sequence yield 0.895 (19% higher than HydrAMP). For prototype-conditioned generation, PLUM produced 37% more AMPs than HydrAMP, generating sequences that closely matched real peptide compositions with low predicted toxicity. Integrated AMP classifiers enabled robust evaluation of identity and potency across diverse bacteria. These results establish PLUM as a scalable, versatile platform for designing AMPs and next-generation therapeutics. Availabilityhttps://github.com/priyamayur/PLUM Contactpb11@iastate.edu, idoerg@iastate.edu, brued@iastate.edu, oeulen@iastate.edu
Peddi, N.; Bijjula, D. R.; Gogte, S.; Kondaparthi, V.
Show abstract
Major Histocompatibility Complex (MHC) molecules are essential to the immune system because they bind and present peptide antigens to T cells, enabling immune recognition and response. The specificity of MHC-peptide interactions is crucial for understanding immune-related diseases, developing personalized immunotherapies, and designing effective vaccines. Current computational methods, while powerful, often rely on a single type of molecular information, usually sequence, and implicitly model the interaction between the two molecules. To address these limitations, we introduce MHC-Bind, a novel deep learning framework that captures a more comprehensive and biologically relevant view of the binding event. MHCBinds architecture employs a dual-view feature extraction strategy for both the MHC and the peptide. A Graph Attention Network (GAT) learns topological features from predicted residue contact maps, while a parallel 1D Convolutional Neural Network (CNN) captures multi-scale patterns from sequence embeddings. These four distinct feature sets are then integrated in a cross-fusion module that uses an attention mechanism to model interactions between the two molecules. Finally, a multi-layer perceptron (MLP) regression head maps the fused interaction signature to a precise binding affinity score. In rigorous comparative benchmarks against recent variants, such as NetMHCpan, MHCFlurry, and MHCnuggets, MHCBind demonstrates superior performance, achieving a significantly lower average prediction error (RMSE: 0.1485) and a higher correlation (PCC: 0.7231) in allele-specific contexts. For pan-allele tasks, it excels at correctly ranking peptides with a superior Spearmans Correlation (SCC: 0.7102), a crucial advantage for practical applications. The frameworks design is inherently flexible, excelling in both allele-specific and pan-allele prediction tasks.
Asgary, A. H.; Aleyasin, A.; Mehl, J. A.; Fallah, S.; Aintablian, H.; Ludewig, B.; Mishto, M.; Liepe, J.; Soeding, J.
Show abstract
Accurate structural modeling of peptide-MHC (pMHC) complexes is a prerequisite for understanding adaptive immunity and developing data-driven immunotherapies. However, current tools are often limited by narrow class coverage, restricted peptide lengths, or insufficient accuracy for downstream design tasks. Here, we introduce PMGen (Peptide MHC Generator), an integrated framework for structure prediction and structure-guided design of variable-length peptides across MHC class I and II. By introducing Initial Guess and Template Engineering as strategies to enforce anchor constraints in AlphaFold2, PMGen achieves state-of-the-art structural fidelity with median peptide core RMSDs of 0.54 [A] for MHC-I and 0.33 [A] for MHC-II, outperforming five state-of-the-art methods. We further demonstrate that PMGen captures the subtle structural impact of single-point neoantigen mutations and that model confidence (pLDDT) reliably correlates with structural accuracy. We investigated two potential applications of our framework: structure-aware peptide design and generating data for machine learning (ML) models. To this end, we introduced a framework to sample peptides with preserved structures and improved binding affinity. As an example for ML application, we fine-tuned ProteinMPNN on PMGen-modeled structures. This improved sequence recovery from 0.19 to 0.40 compared to the baseline. Ultimately, PMGen bridges the gap between high-fidelity structural prediction and downstream sequence design, offering a scalable solution to generate the large-scale, high-quality structural datasets required to train advanced predictive models in immunology. Available at https://github.com/soedinglab/PMGen.
Jabin, A.; Ahmad, S.
Show abstract
Recent advances in large-scale self-supervised learning have led to the emergence of foundation models capable of extracting transferable visual representations from high-dimensional image data. In computational pathology, such models are increasingly used as feature encoders for molecular prediction tasks. However, systematic benchmarking of publicly available image foundation models for transcriptomic prediction from whole-slide images (WSIs) remains limited. Here, we perform a comprehensive evaluation of five state-of-the-art vision foundation models-DINOv2, Phikon, UNI, H-Optimus-0, and MedSigLIP-for gene expression prediction using the TCGA-BRCA cohort. Tile embeddings extracted from each model were aggregated via attention-based multiple instance learning (MIL), followed by multi-target regression to predict RNA-seq expression profiles. Performance was assessed using gene-level Spearman correlation across samples. Histopathology-specific foundation models consistently outperformed general-purpose encoders, with Phikon achieving the strongest overall performance, followed by UNI and H-Optimus-0. These findings demonstrate that domain-aligned pretraining substantially enhances morphology-to-transcriptome inference and provide a principled benchmark for foundation model selection in molecular pathology.
Jia, Y.; Niu, J.; Qie, Z.; Li, Z.; Laine, A. F.; Guo, J.
Show abstract
Accurate classification of brain tumors from MRI is critical for guiding clinical decision-making; however, existing deep learning models are often hindered by limited interpretability and pronounced sensitivity to hyperparameter selection, which constrain their reliability in medical settings. To address these challenges, we propose TumorCLIP, a lightweight and training-efficient vision-language framework that integrates radiology-informed text prototypes with a DenseNet-based visual encoder to support clinically meaningful semantic reasoning, fused via a Tip-Adapter mechanism. TumorCLIP does not aim to introduce a new vision-language model architecture. Instead, its contribution lies in the integration of radiology-informed text proto-types tailored to MRI interpretation, a systematic evaluation of backbone stability across diverse visual architectures, and a lightweight, training-efficient CLIP-based fusion framework designed for medical imaging applications. We first conduct a comprehensive unimodal benchmark across eight representative visual backbones (EfficientNet-B0, MobileNetV3-Large, ResNet50, DenseNet121, ViT, DeiT, Swin Transformer, and MambaOut) using a standardized optimizer and learning-rate grid search, revealing performance swings exceeding 60 percentage points depending on hyperparameter choices. DenseNet121 shows the strongest stability-accuracy trade-off within our evaluated optimizer and learning-rate grid (97.6%). Leveraging this foundation, TumorCLIP fuses image features with frozen CLIP-derived text prototypes, achieving concept-level explainability, robust few-shot adaptation, and enhanced classification of minority tumor classes. On the test set, TumorCLIP attains 98.5% accuracy, including a +1.86 percentage point recall increase for Neurocytoma, suggesting that radiology-informed textual priors can improve semantic alignment and help refine diagnostic decision boundaries within the evaluated setting. Additional evaluation on an independent external dataset shows that TumorCLIP achieves improved cross-dataset performance under the evaluated distribution shift, relative to the unimodal DenseNet121 baseline. These results demonstrate TumorCLIP as a practical, interpretable, and data-efficient alternative to conventional visual classifiers, providing evidence for radiology-aware vision-language alignment in MRI-based brain tumor classification. All results are reported within the evaluated datasets and training protocols.
Pandey, S.; Talo, M.; Siderovski, D. P.; Sumien, N.; Bozdag, S.
Show abstract
Identifying new therapeutic uses for existing drugs is a major challenge in biomedicine, especially for complex neurodegenerative conditions such as Alzheimer disease and related dementias (ADRD), where treatment options remain limited and relevant data are often sparse, heterogeneous, and difficult to integrate. Although general-purpose Large Language Model (LLM) embeddings encode rich semantic information, they often lack the task-specific biomedical context needed for inference tasks such as computational drug repurposing. We introduce Contextualizing LLM Embeddings via Attention-based gRaph learning (CLEAR), a multimodal representation-fusion framework that aligns LLM embeddings with the topological structure of a context-specific Knowledge Graph (KG). Across five benchmark datasets, CLEAR achieved state-of-the-art results, improving predictive performance (e.g., F1 score) by up to 30% over prior methods. We further applied CLEAR to identify FDA-approved drugs with potential for repurposing for ADRD, including Parkinson disease-related dementia and Lewy Body dementia. CLEAR learned a biologically coherent embedding space, prioritized leading ADRD drug candidates, and accurately summarized known therapeutic relationships for FDA-approved Alzheimer disease drugs. Overall, CLEAR shows that grounding biomedical LLM embeddings with context-specific KG signals can improve drug repurposing in data-sparse, real-world settings. GitHub: https://github.com/bozdaglab/CLEAR
Polster, M.; Stadelmaier, J.; Ball, E.; Scheid, J.; Bauer, J.; Nelde, A.; Claassen, M.; Dubbelaar, M. L.; Walz, J. S.; Nahnsen, S.
Show abstract
Mapping of T cell receptors (TCRs) to their cognate MHC-presented peptides (pMHC) is central for the development of precision immunotherapies and vaccine design. However, accurate prediction of TCR affinity to peptide antigens remains an open challenge. Most approaches rely solely on sequence information, although increasing evidence suggests that TCR-pMHC binding is primarily determined by three-dimensional structural interactions within the entire TCR-pMHC complex. Consequently, sequence-based methods often fail to generalize to peptides not included in the training data (unseen peptides). Here we introduce t2pmhc, a structure-based graph neural network framework for predicting TCR-pMHC binding using predicted structures of the entire TCR-pMHC complex. We evaluated a Graph Convolutional Network (GCN) and a Graph Attention Network, both demonstrating improved generalization to unseen peptides compared to state-of-the-art models across a variety of public datasets. Evaluation with crystallographic structures yields high-confidence predictions, indicating that current limitations of structure-based models are largely driven by the accuracy of structure prediction. Analysis of node attention patterns in t2pmhc-GCN reveals biologically consistent patterns, assigning high attention to the peptide and the CDR3 regions. Within the peptide sequence, canonical MHC anchor residues are consistently downweighted, whereas potential TCR-binding residues are upweighted. These findings establish t2pmhc as a structure-informed framework for robust TCR-pMHC binding prediction, enabling improved generalization to unseen antigens and providing a foundation for integrating TCR repertoire sequencing into vaccine design and immunotherapy.
Colangelo, G.; Marti, M.
Show abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
Wen, Y.; Xiong, J.; Gong, F.; Ma, L.; Wan, L.
Show abstract
Single-cell RNA sequencing combined with lineage tracing technologies provides rich opportunities to study development and tumor evolution, yet existing computational methods struggle to disentangle intrinsic transcriptional states from lineage-driven effects. We introduce DeepTracing, a deep generative framework that integrates disentangled representation learning with lineage-aware Gaussian processes to explicitly separate intrinsic cellular variation from lineage constraints. The model constructs a layered latent space and enforces independence via Total Correlation regularization, producing intrinsic, lineage, and unified embeddings. Across extensive benchmarks, DeepTracing consistently outperforms existing approaches. In TedSim simulations, it achieves superior clustering of cell states and effectively recovers phylogenetic structure, surpassing original expression and scVI. Applied to mouse tumor lineage-tracing data, DeepTracing attains higher ARI/NMI for tumor-type classification than scVI and PORCELAN, accurately separating primary and metastatic tumors and recovering known trajectories such as early lymph-node divergence and liver-to-kidney cross-seeding. In larger datasets, it maintains strong performance while preserving both transcriptomic continuity and lineage fidelity. DeepTracing also reconstructs continuous developmental trajectories in mouse ventral midbrain, isolating temporal effects from intrinsic differentiation. These results establish DeepTracing as a scalable and interpretable framework for analyzing multimodal single-cell data in tumor progression. Code availabilityThe source code is publicly available at https://github.com/Yuhong-Wen/DeepTracing.